NetNews Offline 2

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Offline 2 / NetNews Offline Volume 2.iso / news / comp / lang / c-part1 / 3493 < prev next >

Wrap

Internet Message Format | 1996-08-05 | 3.5 KB

Path: mail2news.demon.co.uk!genesis.demon.co.uk From: Lawrence Kirby <fred@genesis.demon.co.uk> Newsgroups: comp.lang.c Subject: Re: Matrix Multiplication Date: Mon, 29 Jan 96 13:43:44 GMT Organization: none Message-ID: <822923024snz@genesis.demon.co.uk> References: <1996Jan22.110440@gamma.ntu.ac.sg> <4ecjv4$fn2@tamarack.cs.mtu.edu> <822792812snz@genesis.demon.co.uk> <DLwzv9.9wu@cwi.nl> Reply-To: fred@genesis.demon.co.uk X-NNTP-Posting-Host: genesis.demon.co.uk X-Newsreader: Demon Internet Simple News v1.27 X-Mail2News-Path: genesis.demon.co.uk In article <DLwzv9.9wu@cwi.nl> dik@cwi.nl "Dik T. Winter" writes: > > >Consider optimizing for cache performance. >[Omitting the initialization] > > >for ( i = 0; i < N; i++ ) > > > for ( j = 0; j < N; j++ ) > > > for ( k = 0; k < N; k++ ) > > > c[i][j] = c[i][j] + a[i][k]*b[k][j]; > > > > Or how about: > > double sum = 0.0; >... > > c[i][j] = sum; > > You compiler might be able to perform that optimisation for itself but > > aliasing issues may make the analysis difficult. > >In this case aliasing might make it difficult indeed or even impossible, >and doing it is indeed an improvement. But the proposed version is *not* >optimal for caching; the optimal way to do it is to make the 'j' loop >innermost: > for ( i = 0; i < N; i++) > for (k = 0; k < N; k++) > for (j = 0; j < N; j++) > c[i][j] = c[i][j] + a[i][k]*b[k][j]; >[and this is also optimal on vector processors]. >The reason is that if at some stage c[i][j] (or b[k][j]) has a cache miss >a whole cache line is brought in, and this includes some elements of 'c' and >'b' that are needed on next steps through the loop. Putting k innermost >gives the benefit only to 'a'. (But it will also depend a lot on the >cache replacement strategy used by the processor.) Right. Since in my version c is only ever written to cache line reads are not necessarily an issue. It is also outside the innermost loop so (in your case where N=1000) a cache miss compared to a 1000 iteration loop (with 1000 potential cache misses for b) should be insignificant. Indeed the chance of being able to accumulate the result in the inner loop in a register is likely to be more significant. >Timings on an SGI with a 100 MHz MIPS R4k (in seconds with N = 1000, 10 >multiplications): >i innermost 11.08 >k innermost 5.92 >k innermost + scalar 4.60 >j innermost 2.59 I guess you'd need to analyse the object code to see what's going on there. Still, this proves that where performance is critical there is no substitute for profiling the alternatives on the target system. It might be worth trying: /* zero all elements of c here */ for (i = 0; i < N; i++) { for (k = 0; k < N; k++) { const double aik = a[i][k]; for (j = 0; j < N; j++) c[i][j] += aik * b[k][j]; } } since that is an optimisation the compiler definitely can't make for itself if it can't tell that a and c are distinct. Actually this makes it clear why your version works fast - *everything* in the inner loop caches well. The only disadvantage is that the c update requires an 'external' read-modify-write sequence. The performance of that will depend on how write caching is implemented. Maybe C could do as well as FORTRAN in this case. -- ----------------------------------------- Lawrence Kirby | fred@genesis.demon.co.uk Wilts, England | 70734.126@compuserve.com -----------------------------------------